roc_auc_score (ROC AUC)#

roc_auc_score computes the area under the ROC curve. It evaluates how well a model ranks positive examples above negative examples, using scores (probabilities or decision values), not hard class labels.

You will learn

  • How thresholds produce points on the ROC curve (TPR vs FPR)

  • Two equivalent AUC formulas: trapezoid area and Mann–Whitney (rank) view

  • A NumPy implementation of roc_curve + roc_auc_score (tie-safe)

  • How to optimize for AUC with a differentiable pairwise surrogate (NumPy)

Quick import#

from sklearn.metrics import roc_auc_score

Prerequisites#

  • Binary classification labels (0/1)

  • Confusion matrix terms: TP / FP / TN / FN

  • Basic probability and calculus

import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score as skl_roc_auc_score
from sklearn.metrics import roc_curve as skl_roc_curve

pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

SEED = 42
rng = np.random.default_rng(SEED)

1) From scores to TPR/FPR (one threshold)#

Assume:

  • true labels: \(y_i \in \{0,1\}\)

  • model scores (higher = more positive): \(s_i \in \mathbb{R}\)

  • threshold: \(\tau\)

We predict positive if:

\[ \hat{y}_i(\tau)=\mathbb{1}[s_i \ge \tau] \]

From the confusion matrix at threshold \(\tau\):

\[ \mathrm{TPR}(\tau)=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, \qquad \mathrm{FPR}(\tau)=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}} \]
  • TPR = recall / sensitivity

  • FPR = 1 - specificity

def confusion_at_threshold(y_true, y_score, threshold, pos_label=1):
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)
    if y_true.shape[0] != y_score.shape[0]:
        raise ValueError("y_true and y_score must have the same length.")

    pos = y_true == pos_label
    pred_pos = y_score >= threshold

    tp = np.sum(pos & pred_pos)
    fp = np.sum(~pos & pred_pos)
    fn = np.sum(pos & ~pred_pos)
    tn = np.sum(~pos & ~pred_pos)
    return tp, fp, tn, fn


def tpr_fpr_from_confusion(tp, fp, tn, fn):
    tpr = tp / (tp + fn) if (tp + fn) > 0 else np.nan
    fpr = fp / (fp + tn) if (fp + tn) > 0 else np.nan
    return tpr, fpr
y_true_small = np.array([1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0])
y_score_small = np.array([0.95, 0.90, 0.80, 0.75, 0.60, 0.55, 0.52, 0.50, 0.40, 0.35, 0.30, 0.10])

threshold = 0.50

tp, fp, tn, fn = confusion_at_threshold(y_true_small, y_score_small, threshold=threshold)
tpr, fpr = tpr_fpr_from_confusion(tp, fp, tn, fn)

df_small = pd.DataFrame({"y_true": y_true_small, "score": y_score_small})
df_small = df_small.sort_values("score", ascending=False).reset_index(drop=True)
df_small[f"y_pred(score \u2265 {threshold:.2f})"] = (df_small["score"] >= threshold).astype(int)

df_small, {"TP": tp, "FP": fp, "TN": tn, "FN": fn, "TPR": tpr, "FPR": fpr}
(    y_true  score  y_pred(score ≥ 0.50)
 0        1   0.95                     1
 1        0   0.90                     1
 2        1   0.80                     1
 3        0   0.75                     1
 4        1   0.60                     1
 5        0   0.55                     1
 6        0   0.52                     1
 7        1   0.50                     1
 8        0   0.40                     0
 9        1   0.35                     0
 10       0   0.30                     0
 11       0   0.10                     0,
 {'TP': 4, 'FP': 4, 'TN': 3, 'FN': 1, 'TPR': 0.8, 'FPR': 0.5714285714285714})

2) ROC curve: sweep the threshold#

The ROC curve plots \((\mathrm{FPR}(\tau), \mathrm{TPR}(\tau))\) as we move the threshold \(\tau\) from very strict to very lenient:

  • \(\tau = +\infty\) ⇒ predict nothing positive ⇒ (FPR,TPR) = (0,0)

  • \(\tau\) decreases ⇒ more predicted positives ⇒ move up/right

  • \(\tau = -\infty\) ⇒ predict everything positive ⇒ (1,1)

A random ranking gives the diagonal line \(\mathrm{TPR} = \mathrm{FPR}\).

def roc_curve_bruteforce(y_true, y_score, pos_label=1):
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)

    thresholds = np.r_[np.inf, np.sort(np.unique(y_score))[::-1]]
    fpr = []
    tpr = []

    for thr in thresholds:
        tp, fp, tn, fn = confusion_at_threshold(y_true, y_score, thr, pos_label=pos_label)
        tpr_i, fpr_i = tpr_fpr_from_confusion(tp, fp, tn, fn)
        fpr.append(fpr_i)
        tpr.append(tpr_i)

    return np.asarray(fpr), np.asarray(tpr), thresholds


fpr_b, tpr_b, thr_b = roc_curve_bruteforce(y_true_small, y_score_small)
auc_b = np.trapz(tpr_b, fpr_b)

df_roc_small = pd.DataFrame({"threshold": thr_b, "fpr": fpr_b, "tpr": tpr_b})
df_roc_small
threshold fpr tpr
0 inf 0.000000 0.0
1 0.95 0.000000 0.2
2 0.90 0.142857 0.2
3 0.80 0.142857 0.4
4 0.75 0.285714 0.4
5 0.60 0.285714 0.6
6 0.55 0.428571 0.6
7 0.52 0.571429 0.6
8 0.50 0.571429 0.8
9 0.40 0.714286 0.8
10 0.35 0.714286 1.0
11 0.30 0.857143 1.0
12 0.10 1.000000 1.0
point_labels = ["inf" if np.isinf(t) else f"{t:.2f}" for t in thr_b]

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("ROC curve (toy example)", "TPR/FPR vs threshold"),
)

fig.add_trace(
    go.Scatter(
        x=fpr_b,
        y=tpr_b,
        mode="lines+markers",
        name=f"ROC (AUC={auc_b:.3f})",
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        line=dict(dash="dash", color="black"),
        name="random",
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=fpr_b,
        y=tpr_b,
        mode="markers+text",
        text=point_labels,
        textposition="top center",
        marker=dict(size=8),
        name="thresholds",
    ),
    row=1,
    col=1,
)

mask = np.isfinite(thr_b)
fig.add_trace(
    go.Scatter(
        x=thr_b[mask],
        y=tpr_b[mask],
        mode="lines+markers",
        name="TPR",
    ),
    row=1,
    col=2,
)
fig.add_trace(
    go.Scatter(
        x=thr_b[mask],
        y=fpr_b[mask],
        mode="lines+markers",
        name="FPR",
    ),
    row=1,
    col=2,
)

fig.update_xaxes(title_text="FPR", range=[0, 1], row=1, col=1)
fig.update_yaxes(title_text="TPR", range=[0, 1], row=1, col=1)
fig.update_xaxes(title_text="threshold τ", autorange="reversed", row=1, col=2)
fig.update_yaxes(title_text="rate", range=[0, 1], row=1, col=2)

fig.update_layout(width=950, height=430)
fig.show()

3) AUC: “area” and “probability of correct ranking”#

The ROC AUC is the area under the ROC curve:

\[ \mathrm{AUC} = \int_0^1 \mathrm{TPR}(u)\,du \]

where we integrate TPR as a function of FPR.

A powerful equivalent view (binary case) is:

\[ \mathrm{AUC} = \mathbb{P}(s^+ > s^-) + \frac{1}{2}\mathbb{P}(s^+ = s^-) \]

where \(s^+\) is the score of a random positive example and \(s^-\) is the score of a random negative example.

So AUC is a ranking metric:

  • any strictly monotonic transform of the score (e.g. logits → probabilities) leaves AUC unchanged

  • AUC = 0.5 means random ranking, AUC = 1.0 means perfect ranking

pos_scores = y_score_small[y_true_small == 1]
neg_scores = y_score_small[y_true_small == 0]

auc_pairwise = (pos_scores[:, None] > neg_scores[None, :]).mean() + 0.5 * (
    pos_scores[:, None] == neg_scores[None, :]
).mean()

auc_pairwise, auc_b
(0.6571428571428571, 0.6571428571428571)
n_pairs = 500
pos_s = rng.choice(pos_scores, size=n_pairs, replace=True)
neg_s = rng.choice(neg_scores, size=n_pairs, replace=True)

df_pairs = pd.DataFrame({"neg_score": neg_s, "pos_score": pos_s})

min_s = float(min(df_pairs["neg_score"].min(), df_pairs["pos_score"].min()))
max_s = float(max(df_pairs["neg_score"].max(), df_pairs["pos_score"].max()))

fig = px.scatter(
    df_pairs,
    x="neg_score",
    y="pos_score",
    opacity=0.55,
    title=(
        "Random positive/negative score pairs (above diagonal = correct ranking)" f"<br>AUC ≈ {auc_pairwise:.3f}"
    ),
)
fig.add_shape(
    type="line",
    x0=min_s,
    y0=min_s,
    x1=max_s,
    y1=max_s,
    line=dict(color="black", dash="dash"),
)
fig.update_xaxes(title="negative score s⁻")
fig.update_yaxes(title="positive score s⁺")
fig.update_layout(width=650, height=520)
fig.show()

4) NumPy implementation (ROC curve + ROC AUC)#

A direct implementation by scanning all thresholds can be \(O(n^2)\).

A faster approach:

  1. Sort examples by score (descending)

  2. Sweep the threshold from high to low

  3. Track cumulative TP and FP counts

  4. Record a ROC point only when the score changes (tie handling)

This is \(O(n \log n)\) due to sorting.

def roc_curve_np(y_true, y_score, pos_label=1):
    """Compute ROC curve points (FPR, TPR) for binary classification.

    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        Binary labels. Anything equal to `pos_label` is treated as positive.
    y_score : array-like of shape (n_samples,)
        Scores where larger means "more positive".
    pos_label : label (default=1)
        Which label is considered positive.
    """
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)
    if y_true.shape[0] != y_score.shape[0]:
        raise ValueError("y_true and y_score must have the same length.")

    pos = y_true == pos_label
    n_pos = int(pos.sum())
    n_neg = int((~pos).sum())
    if n_pos == 0 or n_neg == 0:
        raise ValueError("roc_curve is undefined with only one class present in y_true.")

    order = np.argsort(-y_score, kind="mergesort")
    y_score_sorted = y_score[order]
    y_pos_sorted = pos[order].astype(int)

    distinct_value_indices = np.where(np.diff(y_score_sorted))[0]
    threshold_idxs = np.r_[distinct_value_indices, y_pos_sorted.size - 1]

    tps = np.cumsum(y_pos_sorted)[threshold_idxs]
    fps = 1 + threshold_idxs - tps

    # Prepend the point at threshold +inf: (FPR,TPR) = (0,0)
    tps = np.r_[0, tps]
    fps = np.r_[0, fps]
    thresholds = np.r_[np.inf, y_score_sorted[threshold_idxs]]

    fpr = fps / n_neg
    tpr = tps / n_pos
    return fpr, tpr, thresholds


def roc_auc_score_np(y_true, y_score, pos_label=1):
    fpr, tpr, _ = roc_curve_np(y_true, y_score, pos_label=pos_label)
    return float(np.trapz(tpr, fpr))


def rankdata_average_ties(x):
    """Ranks starting at 1, using average ranks for ties (NumPy-only)."""
    x = np.asarray(x)
    order = np.argsort(x, kind="mergesort")
    x_sorted = x[order]

    ranks_sorted = np.empty_like(x_sorted, dtype=float)

    n = len(x_sorted)
    i = 0
    rank = 1
    while i < n:
        j = i + 1
        while j < n and x_sorted[j] == x_sorted[i]:
            j += 1

        # ranks for i..j-1 are rank..rank+(j-i)-1
        avg_rank = 0.5 * (rank + (rank + (j - i) - 1))
        ranks_sorted[i:j] = avg_rank

        rank += j - i
        i = j

    ranks = np.empty_like(ranks_sorted)
    ranks[order] = ranks_sorted
    return ranks


def roc_auc_score_mann_whitney_np(y_true, y_score, pos_label=1):
    """AUC via Mann–Whitney U / Wilcoxon rank-sum (tie-safe)."""
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)
    if y_true.shape[0] != y_score.shape[0]:
        raise ValueError("y_true and y_score must have the same length.")

    pos = y_true == pos_label
    n_pos = int(pos.sum())
    n_neg = int((~pos).sum())
    if n_pos == 0 or n_neg == 0:
        raise ValueError("roc_auc_score is undefined with only one class present in y_true.")

    ranks = rankdata_average_ties(y_score)
    sum_ranks_pos = ranks[pos].sum()
    u = sum_ranks_pos - n_pos * (n_pos + 1) / 2
    return float(u / (n_pos * n_neg))
y_true = rng.integers(0, 2, size=300)
y_score = rng.normal(size=300)

auc_np = roc_auc_score_np(y_true, y_score)
auc_mw = roc_auc_score_mann_whitney_np(y_true, y_score)
auc_skl = skl_roc_auc_score(y_true, y_score)

auc_np, auc_mw, auc_skl
(0.4551141695339381, 0.4551141695339381, 0.4551141695339381)
# Our curve matches sklearn when drop_intermediate=False (sklearn defaults to drop_intermediate=True)
fpr_np, tpr_np, thr_np = roc_curve_np(y_true, y_score)
fpr_skl, tpr_skl, thr_skl = skl_roc_curve(y_true, y_score, drop_intermediate=False)

(
    np.allclose(fpr_np, fpr_skl),
    np.allclose(tpr_np, tpr_skl),
    np.allclose(thr_np, thr_skl),
    len(fpr_np),
    len(skl_roc_curve(y_true, y_score)[0]),
)
(True, True, True, 301, 171)
# AUC is invariant to strictly monotonic transforms of the scores
auc_logits = roc_auc_score_np(y_true, y_score)
auc_prob = roc_auc_score_np(y_true, 1 / (1 + np.exp(-y_score)))

auc_logits, auc_prob
(0.4551141695339381, 0.4551141695339381)

5) Visual intuition: distributions → thresholds → ROC points#

Below we draw score distributions for each class and place a few thresholds. Each threshold maps to a point on the ROC curve.

n_pos, n_neg = 250, 750
scores_pos = rng.normal(loc=1.2, scale=1.0, size=n_pos)
scores_neg = rng.normal(loc=0.0, scale=1.0, size=n_neg)

y_true_big = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
y_score_big = np.r_[scores_pos, scores_neg]

perm = rng.permutation(len(y_true_big))
y_true_big = y_true_big[perm]
y_score_big = y_score_big[perm]

fpr, tpr, thresholds = roc_curve_np(y_true_big, y_score_big)
auc_val = roc_auc_score_np(y_true_big, y_score_big)

thresholds_demo = np.quantile(y_score_big, [0.9, 0.5, 0.1])
colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Score distributions", f"ROC curve (AUC={auc_val:.3f})"),
)

fig.add_trace(
    go.Histogram(
        x=y_score_big[y_true_big == 0],
        name="negative",
        opacity=0.6,
        nbinsx=40,
        marker_color="gray",
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Histogram(
        x=y_score_big[y_true_big == 1],
        name="positive",
        opacity=0.6,
        nbinsx=40,
        marker_color="crimson",
    ),
    row=1,
    col=1,
)

for thr, c in zip(thresholds_demo, colors):
    fig.add_vline(x=float(thr), line_dash="dash", line_color=c, row=1, col=1)

fig.add_trace(go.Scatter(x=fpr, y=tpr, mode="lines", name="ROC"), row=1, col=2)
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        line=dict(dash="dash", color="black"),
        name="random",
    ),
    row=1,
    col=2,
)

for thr, c in zip(thresholds_demo, colors):
    tp, fp, tn, fn = confusion_at_threshold(y_true_big, y_score_big, threshold=float(thr))
    tpr_thr, fpr_thr = tpr_fpr_from_confusion(tp, fp, tn, fn)
    fig.add_trace(
        go.Scatter(
            x=[fpr_thr],
            y=[tpr_thr],
            mode="markers",
            marker=dict(size=10, color=c),
            name=f"τ={thr:.2f}",
        ),
        row=1,
        col=2,
    )

fig.update_layout(barmode="overlay", width=950, height=430)
fig.update_xaxes(title_text="score", row=1, col=1)
fig.update_yaxes(title_text="count", row=1, col=1)
fig.update_xaxes(title_text="FPR", range=[0, 1], row=1, col=2)
fig.update_yaxes(title_text="TPR", range=[0, 1], row=1, col=2)

fig.show()

6) Class imbalance: ROC AUC is prevalence-invariant (PR AUC is not)#

ROC uses rates (TPR/FPR), so duplicating every negative example (same scores) leaves the curve and AUC unchanged.

Precision–recall metrics do change with prevalence, so PR AUC is often preferred for extreme imbalance.

# Duplicate negatives 10x (same scores) to change prevalence
y_true_imbal = np.r_[y_true_big[y_true_big == 1], np.repeat(y_true_big[y_true_big == 0], 10)]
y_score_imbal = np.r_[y_score_big[y_true_big == 1], np.repeat(y_score_big[y_true_big == 0], 10)]

auc_orig = roc_auc_score_np(y_true_big, y_score_big)
auc_imbal = roc_auc_score_np(y_true_imbal, y_score_imbal)

ap_orig = average_precision_score(y_true_big, y_score_big)
ap_imbal = average_precision_score(y_true_imbal, y_score_imbal)

auc_orig, auc_imbal, ap_orig, ap_imbal
(0.8279306666666666,
 0.8279306666666666,
 0.6333778374447112,
 0.2010312536095087)
fpr_o, tpr_o, _ = roc_curve_np(y_true_big, y_score_big)
fpr_i, tpr_i, _ = roc_curve_np(y_true_imbal, y_score_imbal)

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr_o, y=tpr_o, mode="lines", name=f"original (AUC={auc_orig:.3f})"))
fig.add_trace(go.Scatter(x=fpr_i, y=tpr_i, mode="lines", name=f"negatives ×10 (AUC={auc_imbal:.3f})"))
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        line=dict(dash="dash", color="black"),
        showlegend=False,
    )
)
fig.update_layout(
    title="ROC curves overlap under prevalence shift",
    xaxis_title="FPR",
    yaxis_title="TPR",
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
    width=720,
    height=450,
)
fig.show()

7) Practical usage (scikit-learn)#

Key points:

  • Pass scores, not hard labels.

    • predict_proba(X)[:, 1] (probabilities)

    • decision_function(X) (raw scores / logits)

  • Any monotonic transform of scores gives the same AUC.

  • For multiclass you must choose multi_class="ovr" or "ovo" and an averaging strategy.

Docs: sklearn.metrics.roc_auc_score.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=2000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    weights=[0.85, 0.15],
    class_sep=1.0,
    random_state=SEED,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=SEED
)

clf = LogisticRegression(max_iter=2000)
clf.fit(X_train, y_train)

score_logit = clf.decision_function(X_test)
score_proba = clf.predict_proba(X_test)[:, 1]
score_label = clf.predict(X_test)

# (logits and probabilities have identical ranking → identical AUC)
skl_roc_auc_score(y_test, score_logit), skl_roc_auc_score(y_test, score_proba), skl_roc_auc_score(y_test, score_label)
(0.7989558370421088, 0.7989558370421088, 0.5415525504964054)

8) Optimizing for ROC AUC (NumPy)#

For binary labels, AUC can be written as an average over all positive–negative pairs:

\[ \mathrm{AUC}(s) = \frac{1}{|P||N|} \sum_{i\in P}\sum_{j\in N} \Big(\mathbb{1}[s_i > s_j] + \tfrac{1}{2}\mathbb{1}[s_i = s_j]\Big) \]

This depends on pairwise orderings (rankings), which makes it:

  • non-decomposable over single examples

  • non-differentiable because of the indicator

A common workaround is to optimize a smooth pairwise surrogate. For a linear scoring model \(s(x)=w^\top x\) one choice is the pairwise logistic loss:

\[ L(w)=\frac{1}{|P||N|}\sum_{i\in P}\sum_{j\in N} \log\big(1+\exp\big(-(s_i - s_j)\big)\big) \]

Minimizing \(L\) encourages \(s_i > s_j\) for positive \(i\) and negative \(j\), i.e. better AUC.

In practice we sample pairs (SGD) instead of enumerating all \(|P||N|\) pairs.

def sigmoid(z):
    z = np.asarray(z)
    z = np.clip(z, -40, 40)
    return 1 / (1 + np.exp(-z))


def add_bias(X):
    X = np.asarray(X)
    return np.c_[np.ones(X.shape[0]), X]


def make_gaussian_binary(n_pos=250, n_neg=1250, seed=0):
    rng_local = np.random.default_rng(seed)
    mean_pos = np.array([1.5, 1.5])
    mean_neg = np.array([0.0, 0.0])
    cov = np.array([[1.0, 0.3], [0.3, 1.0]])

    X_pos = rng_local.multivariate_normal(mean_pos, cov, size=n_pos)
    X_neg = rng_local.multivariate_normal(mean_neg, cov, size=n_neg)

    X = np.vstack([X_pos, X_neg])
    y = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]

    perm = rng_local.permutation(len(y))
    return X[perm], y[perm]


def train_logistic_logloss_gd(X, y, lr=0.2, steps=2000, l2=1e-3, log_every=50):
    Xb = add_bias(X)
    y = y.astype(float)

    w = np.zeros(Xb.shape[1])
    hist = []

    for step in range(steps + 1):
        scores = Xb @ w
        p = sigmoid(scores)

        grad = (Xb.T @ (p - y)) / len(y)
        reg_grad = l2 * np.r_[0.0, w[1:]]  # don't regularize bias
        w -= lr * (grad + reg_grad)

        if step % log_every == 0:
            logloss = -(y * np.log(p + 1e-12) + (1 - y) * np.log(1 - p + 1e-12)).mean()
            auc = roc_auc_score_np(y.astype(int), scores)
            hist.append({"step": step, "logloss": logloss, "train_auc": auc})

    return w, pd.DataFrame(hist)


def train_auc_pairwise_sgd(
    X, y, lr=0.2, steps=4000, batch_pairs=512, l2=1e-3, log_every=50, seed=0
):
    rng_local = np.random.default_rng(seed)
    Xb = add_bias(X)
    y = y.astype(int)

    pos_idx = np.flatnonzero(y == 1)
    neg_idx = np.flatnonzero(y == 0)
    if len(pos_idx) == 0 or len(neg_idx) == 0:
        raise ValueError("Need both classes to optimize AUC.")

    w = np.zeros(Xb.shape[1])
    hist = []

    for step in range(steps + 1):
        i = rng_local.choice(pos_idx, size=batch_pairs, replace=True)
        j = rng_local.choice(neg_idx, size=batch_pairs, replace=True)

        delta = Xb[i] - Xb[j]  # x_i - x_j
        d = delta @ w  # (w^T x_i) - (w^T x_j)

        # loss = log(1 + exp(-d))
        # dloss/dd = -sigmoid(-d)
        grad = -(sigmoid(-d)[:, None] * delta).mean(axis=0)

        reg_grad = l2 * np.r_[0.0, w[1:]]
        w -= lr * (grad + reg_grad)

        if step % log_every == 0:
            scores = Xb @ w
            auc = roc_auc_score_np(y, scores)
            pair_loss = np.log1p(np.exp(-d)).mean()
            hist.append({"step": step, "pair_loss": pair_loss, "train_auc": auc})

    return w, pd.DataFrame(hist)
X, y = make_gaussian_binary(seed=SEED)

# manual split (stratified-ish via shuffling; dataset is large enough here)
idx = rng.permutation(len(y))
n_train = int(0.7 * len(y))
train_idx, test_idx = idx[:n_train], idx[n_train:]

X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]

w_ce, hist_ce = train_logistic_logloss_gd(X_train, y_train, lr=0.3, steps=2000, log_every=50)
w_auc, hist_auc = train_auc_pairwise_sgd(
    X_train, y_train, lr=0.3, steps=3000, batch_pairs=1024, log_every=50, seed=SEED
)

scores_ce_test = add_bias(X_test) @ w_ce
scores_auc_test = add_bias(X_test) @ w_auc

auc_ce_test = roc_auc_score_np(y_test, scores_ce_test)
auc_auc_test = roc_auc_score_np(y_test, scores_auc_test)

auc_ce_test, auc_auc_test
(0.9139835858585859, 0.9127604166666666)
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=hist_ce["step"],
        y=hist_ce["train_auc"],
        mode="lines",
        name="log-loss GD (train AUC)",
    )
)
fig.add_trace(
    go.Scatter(
        x=hist_auc["step"],
        y=hist_auc["train_auc"],
        mode="lines",
        name="pairwise AUC surrogate (train AUC)",
    )
)
fig.update_layout(
    title="Training AUC over iterations",
    xaxis_title="step",
    yaxis_title="ROC AUC",
    yaxis=dict(range=[0, 1]),
    width=760,
    height=430,
)
fig.show()
fpr_ce, tpr_ce, _ = roc_curve_np(y_test, scores_ce_test)
fpr_auc, tpr_auc, _ = roc_curve_np(y_test, scores_auc_test)

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=fpr_ce,
        y=tpr_ce,
        mode="lines",
        name=f"log-loss GD (test AUC={auc_ce_test:.3f})",
    )
)
fig.add_trace(
    go.Scatter(
        x=fpr_auc,
        y=tpr_auc,
        mode="lines",
        name=f"AUC surrogate (test AUC={auc_auc_test:.3f})",
    )
)
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        line=dict(dash="dash", color="black"),
        showlegend=False,
    )
)
fig.update_layout(
    title="Test ROC curves",
    xaxis_title="FPR",
    yaxis_title="TPR",
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
    width=760,
    height=450,
)
fig.show()

Pros / cons / when to use#

Pros

  • Threshold-free: summarizes performance across all thresholds

  • Ranking-focused: \(\mathbb{P}(s^+ > s^-)\) interpretation is often intuitive

  • Invariant to monotonic score transforms (logits vs probabilities)

  • Less sensitive to class imbalance than accuracy (uses normalized rates)

Cons / pitfalls

  • Not about calibration: probabilities can be badly calibrated and still yield high AUC

  • Weights all FPR regions equally; if you care about tiny FPR, consider partial AUC

  • For extreme imbalance, PR AUC can be more informative than ROC AUC

  • Undefined if y_true contains only one class; multiclass requires design choices (ovr/ovo, averaging)

Good for

  • Model comparison when you care about ranking / screening

  • Imbalanced classification when you want a threshold-independent ranking metric

Less good for

  • Picking a single operating threshold under asymmetric costs

  • Measuring probability quality (use log-loss, Brier score, calibration curves)

Exercises#

  1. Implement partial AUC for a max FPR (e.g. integrate only over \(\mathrm{FPR}\in[0,0.1]\)).

  2. Extend roc_curve_np to support sample weights.

  3. Show numerically that AUC is unchanged by any strictly monotonic transform (try np.tanh, np.exp, sigmoid).

  4. Multiclass: compute one-vs-rest AUC for each class and compare macro vs weighted averages.

References#

  • scikit-learn roc_auc_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

  • scikit-learn ROC user guide: https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc

  • T. Fawcett (2006), An introduction to ROC analysis

  • Hanley & McNeil (1982), The meaning and use of the area under a ROC curve